fix(tbtc/signer): gate plaintext state, zeroize FFI buffers, allow periodic reshare by mswilkison · Pull Request #4124 · threshold-network/keep-core

mswilkison · 2026-06-30T15:39:11Z

Summary

Three findings from the #4005 review (the multi-agent custody review + Codex's review), fixed and verified.

1. [HIGH] Plaintext state fallback was accepted unconditionally — `persistence.rs`

decode_persisted_state_storage_format fell back to parsing an unencrypted, unauthenticated PersistedEngineState whenever the bytes weren't a valid AEAD envelope — with no gate at all. The AEAD envelope is the only integrity control for state-at-rest, so anyone who could write the state file could forge it (cleared consumed_* replay markers → resume-into-double-sign, or an attacker dkg_key_package → admission bypass) without holding the state-encryption key.

Per the secret-material hardening plan, the plaintext path is now emergency-rollback-only: compile-time disabled in release builds (cfg!(debug_assertions)), rejected in production profiles, and gated behind an explicit TBTC_SIGNER_PERMIT_PLAINTEXT_STATE_ROLLBACK opt-in. The two tests that load plaintext (migration + schema-mismatch) opt in, and a negative assertion pins the fail-closed default.

2. [Codex P1] Secret FFI buffers left unwiped on free — `ffi.rs`

free_buffer now zeroizes the buffer before deallocation, so secret material returned by nonce/DKG/key-package endpoints (e.g. frost_tbtc_generate_nonces_and_commitments) is wiped rather than left in allocator memory when a caller forgets to clear it. (zeroize is a volatile wipe the optimizer can't elide.)

3. [Codex P2] Periodic reshare was a one-shot — `lifecycle.rs`

After the first refresh_shares, any later reshare with updated current_shares had a different fingerprint and was rejected as SessionConflict, so a long-lived session could never advance refresh_history/refresh_count/cadence past the first refresh. A different fingerprint is now treated as a new periodic refresh; idempotent replay of the same fingerprint is unchanged.

Deferred (flagged for the team)

Signing-policy firewall production force-on (my MEDIUM finding): mirroring the provenance gate is correct in principle, but force-enabling the firewall makes its entire policy config (allowed script classes, output caps, …) mandatory in production — a deployment-contract change. The verifier flagged it as needing product confirmation, so it's left as-is pending that decision.
State-at-rest anti-rollback freshness counter and quarantine-reset replay-marker loss: design-level (needs an external monotonic anchor) / inherent to that opt-in policy (default is fail-closed).

Verification

cargo test — 297 passed, 0 failed.
cargo clippy --all-targets -- -D warnings — clean (exit 0).

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

🤖 Generated with Claude Code

…riodic reshare Three findings from the #4005 review (workflow review + Codex): 1. [HIGH] persistence: the legacy unencrypted plaintext state fallback was accepted unconditionally, bypassing the AEAD envelope -- anyone who could write the state file could forge it (cleared replay markers / attacker key material) without the state-encryption key. Per the secret-material hardening plan the plaintext path is now emergency-rollback-only: compile-time disabled in release builds, rejected in production profiles, and gated behind an explicit TBTC_SIGNER_PERMIT_PLAINTEXT_STATE_ROLLBACK opt-in. The two tests that load plaintext now opt in; a negative assertion pins the fail-closed default. 2. [Codex P1] ffi: free_buffer now zeroizes the buffer before deallocation, so secret material returned by nonce/DKG/key-package endpoints is wiped rather than left in allocator memory when a caller forgets to clear it. 3. [Codex P2] lifecycle: a subsequent same-session refresh with updated shares produced a different fingerprint and was rejected as SessionConflict, so a long-lived session could never advance refresh_history past the first refresh. A different fingerprint is now treated as a new periodic refresh; idempotent replay of the same fingerprint is unchanged. Deferred (flagged, not in this change): - Signing-policy firewall production force-on: force-enabling it makes the entire firewall policy config mandatory in production (a deployment-contract change); the verifier flagged it as needing product confirmation, so left as-is. - State-at-rest anti-rollback freshness counter and quarantine-reset marker loss: design-level / inherent to the opt-in policy. Verified: cargo test (297 passed, 0 failed) and cargo clippy -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-30T15:39:47Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: d38e6243-cdfc-4c1f-92f3-2228716f365c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/signer-review-findings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

Follow-up to the #4124 review (multi-agent + Codex), which found regressions in the periodic-refresh change and a gap in the FFI wipe: 1. Refresh stale-retry (CONFIRMED, Codex P2): removing SessionConflict made refresh re-execute on a replayed/older request -- re-deriving stale shares and bumping the epoch. Now a fingerprint already in refresh_history (accepted but no longer current) is rejected as a stale retry; a genuinely-new fingerprint still proceeds (repeatable periodic reshares). RefreshHistoryRecord gains an optional request_fingerprint (serde-default, backward compatible). 2. Unbounded refresh_history (CONFIRMED): the removed guard was the only bound. Cap per-session history at MAX_REFRESH_HISTORY (256), dropping oldest; strictly-increasing epochs preserved so continuity still holds. 3. FFI incomplete wipe (PLAUSIBLE): to_ffi_buffer's into_boxed_slice() shrinks and reallocates when capacity > len (serde_json over-allocates), freeing the original secret buffer un-wiped. Copy into an exact boxed slice and zeroize the source Vec so no un-wiped copy survives. 4. Quarantine schema-mismatch test exercised plaintext-refusal, not schema validation, after the plaintext gate landed: opt into the rollback flag like its sibling so it tests the intended path. 5. Plaintext-gate doc comment overclaimed "compile-time disabled in release"; clarify the load-bearing guard is the runtime non-production check. Verified: cargo test (298 passed) + clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Codex re-review: the rollback-path tests opt into the plaintext path, which legacy_plaintext_state_permitted() compiles out in release (cfg!(debug_assertions) is false), so under `cargo test --release` they fail at the post-acceptance assertions. cfg-gate the three plaintext-acceptance tests to debug_assertions; release always refuses plaintext before those assertions, so there is nothing for them to cover there. Verified: cargo test (298 passed, debug) + cargo test --release (295 passed; the 3 tests correctly compiled out) + clippy --release -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Codex re-review: a host that installs an init-time config has signer_env_var read the installed config, not the process environment, so the documented TBTC_SIGNER_PERMIT_PLAINTEXT_STATE_ROLLBACK env flag had no effect for it -- the emergency plaintext migration was impossible to enable via the configured-host path. Add permit_plaintext_state_rollback to InitSignerConfigRequest and map it in config_values_from_request, mirroring the other enforcement flags. Verified: cargo test (298 passed) + clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Codex re-review: a session loaded from state written before RefreshHistoryRecord.request_fingerprint deserializes its history records with None, so the last accepted refresh fingerprint survives only in session.refresh_request_fingerprint. The first post-upgrade periodic refresh overwrote that value, so a delayed retry of the pre-upgrade refresh no longer matched the history scan and was accepted as a new epoch instead of SessionConflict. Before overwriting refresh_request_fingerprint, backfill the previously-accepted fingerprint onto the most-recent history record when it is not already tracked (the legacy None case). Adds a regression test that simulates pre-upgrade state (strip history fingerprints) and verifies a retry of the old refresh is rejected after a new one. Verified: cargo test (299 debug / 296 release) + clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Codex re-review: refresh_cadence_status reported refresh_count as session.refresh_history.len(), which the MAX_REFRESH_HISTORY cap bounds -- so a long-lived session that ran more than 256 refreshes showed a stuck count of 256 even as later refreshes succeeded. Add a monotonic SessionState.refresh_count (and its PersistedSessionState mirror, serde-default for back-compat, plus both conversions) incremented on each accepted refresh and backfilled from the retained history length for legacy state; report it from refresh_cadence_status instead of the pruned history length. Adds a regression test that prunes history and asserts the reported count is the true total. Verified: cargo test (300 passed) + clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… load Codex re-review: refresh_count is #[serde(default)], so state written before the field existed deserializes it to 0 even when refresh_history holds accepted refreshes -- and since refresh_cadence_status now reports refresh_count, an upgraded node reported 0 until the next refresh. Backfill it from refresh_history.len() in the Persisted->Session conversion (evaluated before refresh_history is moved), with a regression test. Verified: cargo test + clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ntine recheck) Codex re-review of the signer FFI surface: 1. [P1] Secret request fields not zeroized (api.rs): SignShareRequest.nonces_hex and key_package_hex deserialized into plain Strings dropped without zeroization, retaining the one-time nonce / private key package in freed heap. Wrap both in Zeroizing<String> (wiped on drop) + a redacting Debug so the secrets no longer leak via {:?} either. 2. [P1] Panic hook defeats payload redaction (ffi.rs): catch_unwind does not suppress Rust's default panic hook, so a panic payload (path/config/secret) prints to stderr before being converted to the redacted ErrorResponse. Install a panic hook at the FFI boundary that redacts the payload outside development. 3. [P2] Quarantine not rechecked on cached-round reuse (signing.rs): the matched- fingerprint reuse branch returned a fresh share without the new-round path's enforce_not_quarantined_identifiers check, so a participant quarantined after the round opened could still obtain a share. Apply the check on reuse too. Verified: cargo test (301 passed) + clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…pty history Codex re-review: legacy state written before refresh_history existed can carry refresh_request_fingerprint/refresh_result with an EMPTY refresh_history. The fingerprint backfill used refresh_history.last_mut(), which is None for empty history, so the previous fingerprint was not recorded and a delayed retry of that pre-upgrade refresh was re-executed as a new refresh. When there is no record to backfill onto, synthesize one from the cached refresh_result (prior epoch, keeping history strictly increasing) carrying the previous fingerprint, so stale retries stay rejected. Adds a regression test for the empty-history legacy shape. Verified: cargo test (302 passed) + clippy -D warnings + fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…h env Two Codex-flagged issues in the non-interactive signing path: 1. start_sign_round fingerprinted and derived the round id from the raw request.message_hex string. hex::decode accepts mixed casing, so two calls carrying identical message bytes but different hex casing derived different fingerprints/round ids and a semantically identical retry was rejected as a SessionConflict. Lowercase message_hex right after it validates as hex, mirroring the interactive path (interactive.rs), so the fingerprint and derive_round_id are casing-independent. 2. The README-documented `cargo bench --features bench-restart-hook --bench phase5_roast` failed in a clean shell: ensure_benchmark_environment never set TBTC_SIGNER_PROFILE (missing -> production) or TBTC_SIGNER_STATE_ENCRYPTION_KEY_HEX (required by the default `env` state-key provider), so the first RunDkg persist aborted. Seed profile=development, provider=env, and a fixed dev key. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The prior commit lowercases message_hex before computing every canonical and legacy fingerprint. That regressed the upgrade path: when a node upgrades with an active sign round that a previous build started using mixed/uppercase message_hex, the persisted sign_request_fingerprint was computed over that original casing. The same retry then matched none of the fingerprints and fell through to SessionConflict instead of reusing the cached round. Capture the pre-canonicalization casing and, only when it differs from the lowercased form, compute the four legacy fingerprint shapes over the original casing and accept them in the cached-round match. A match migrates the stored fingerprint to the canonical lowercase form, so the compatibility cost is paid once per in-flight round. Adds a regression test that persists a mixed-case fingerprint and asserts the retry reuses the round and migrates it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d result The empty-history synthesize branch was guarded by `else if let Some(previous_result)`, so a legacy/degraded state that kept refresh_request_fingerprint but whose refresh_result deserialized to None (and whose refresh_history is empty) fell through: neither the last_mut() backfill nor the synthesize branch ran. The next refresh then overwrote the only copy of the old fingerprint, so a delayed retry of the pre-upgrade request was treated as fresh and advanced the epoch — a replay. Make the branch unconditional: synthesize a history record carrying the previous fingerprint whether or not a cached result is present. Prefer the result's epoch (when non-zero and below the new refresh) and share count; otherwise fall back to refresh_epoch-1 (min 1) and a zero share count. refresh_epoch_counter is persisted, so a prior accepted refresh implies refresh_epoch >= 2 and the fallback stays non-zero and strictly below the new record, keeping history monotonic. Adds a regression test (empty history + refresh_result=None) verified to fail against the pre-fix code (the stale retry was re-executed instead of rejected). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mswilkison and others added 11 commits June 30, 2026 13:28

mswilkison merged commit 01e274b into extraction/frost-signer-mirror-2026-05-26 Jul 1, 2026
20 checks passed

mswilkison deleted the fix/signer-review-findings branch July 1, 2026 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tbtc/signer): gate plaintext state, zeroize FFI buffers, allow periodic reshare#4124

fix(tbtc/signer): gate plaintext state, zeroize FFI buffers, allow periodic reshare#4124
mswilkison merged 12 commits into
extraction/frost-signer-mirror-2026-05-26from
fix/signer-review-findings

mswilkison commented Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mswilkison commented Jun 30, 2026

Summary

1. [HIGH] Plaintext state fallback was accepted unconditionally — persistence.rs

2. [Codex P1] Secret FFI buffers left unwiped on free — ffi.rs

3. [Codex P2] Periodic reshare was a one-shot — lifecycle.rs

Deferred (flagged for the team)

Verification

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. [HIGH] Plaintext state fallback was accepted unconditionally — `persistence.rs`

2. [Codex P1] Secret FFI buffers left unwiped on free — `ffi.rs`

3. [Codex P2] Periodic reshare was a one-shot — `lifecycle.rs`

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading